# Vision-Language Models
Test Push
Apache-2.0
distilvit is an image-to-text model based on a VIT image encoder and a distilled GPT-2 text decoder, capable of generating textual descriptions of images.
Image-to-Text
Transformers

T
tarekziade
17
0
MMICL Instructblip T5 Xxl
MIT
MMICL is a multimodal vision-language model combining blip2/instructblip, capable of analyzing and understanding multiple images while following instructions.
Image-to-Text
Transformers English

M
BleachNick
156
11
Featured Recommended AI Models